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Abstract.  We  propose  the  multiple  LUT  cascade  as  a  means  to  configure  an  n- 
input  LPM  (Longest  Prefix  Match)  address  generator  commonly  used  in  routers 
to  determine  the  output  port  given  an  address.  The  LPM  address  generator  accepts 
n-bit  addresses  which  it  matches  against  k  stored  prefixes.  We  implement  our 
design  on  a  Xilinx  Spartan-3  FPGA  for  n  =  32  and  k  =  504  ~  511.  Also,  we 
compare  our  design  to  a  Xilinx  proprietary  TCAM  (ternary  content-addressable 
memory)  design  and  to  another  design  we  propose  as  a  likely  solution  to  this 
problem.  Our  best  multiple  LUT  cascade  implementation  has  5.20  times  more 
throughput,  31.71  times  more  throughput/area  and  is  2.89  times  more  efficient 
in  terms  of  area-delay  product  than  Xilinx’s  proprietary  design.  Furthermore,  its 
area  is  only  19%  of  Xilinx’s  design. 


1  Introduction 

The  need  for  higher  internet  speeds  is  likely  to  be  the  subject  of  intense  interest  for 
many  years  to  come.  A  network’s  speed  is  directly  related  to  the  speed  with  which  a 
node  can  switch  a  packet  from  an  input  port  to  an  output  port.  This,  in  turn,  depends 
on  how  fast  a  packet’s  address  can  be  accessed  in  memory.  The  longest  prefix  match 
(LPM)  problem  is  one  of  determining  the  output  port  address  from  a  list  of  prefix 
vectors  stored  in  memory.  For  example,  if  the  prefix  vector  01001****  is  stored  in 
memory,  then  the  packet  address  010011111  matches  this  entry.  That  is,  each  bit  in  the 
packet  address  matches  exactly  the  corresponding  digit  in  the  prefix  vector  or  there  is  a 
*  or  don ’t  care  in  that  digit.  If  other  stored  prefixes  match  the  packet  address,  then  the 
prefix  with  the  least  don’t  care  values  determines  the  output  port  address.  That  is,  the 
memory  entry  corresponding  to  the  longest  prefix  match  determines  the  output  port. 

An  ideal  device  for  this  application  is  a  ternary  content-addressable  memory 
(TCAM).  The  descriptor  ’’ternary”  refers  to  the  three  values  stored,  0,  1,  and  *.  Unfor¬ 
tunately,  TCAM  dissipates  much  more  power  than  standard  RAM  [1]. 

Several  authors  have  proposed  the  use  of  standard  RAM  in  LPM  design.  Gupta, 
Lin,  and  McKeown  showed  a  mechanism  to  perform  LPM  every  memory  access  [2]. 
Dharmapurikar,  Krishnamurthy,  and  Taylor  propose  the  use  of  Bloom  filters  to  solve 
the  LPM  problem  [3].  Sasao  and  Butler  have  shown  that  a  fast,  power-efficient  TCAM 
realization  using  a  look-up  table  (LUT)  cascade  [4]. 

In  this  paper,  we  propose  an  extension  to  the  LUT  cascade  realization:  a  multiple 
LUT  cascade  realization  that  consists  of  multiple  LUT  cascades  connected  to  a  special 
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encoder.  This  offers  even  more  efficient  realizations  in  an  architecture  that  is  more  easily 
reconfigured  when  additional  prefix  vectors  are  placed  in  the  prefix  table. 

We  have  implemented  six  types  of  LPM  address  generators  on  the  Xilinx  Spartan- 
3  FPGA  (XC3S4000-5):  Four  different  realizations  using  multiple  LUT  cascades,  one 
using  Xilinx’s  TCAM  realization  based  on  the  Xilinx  IP  core,  and  one  using  registers 
and  gates.  In  addition,  we  compare  the  six  types  of  LPM  address  generators  on  the  basis 
of  delay,  delay-area  product,  throughput,  throughput/area,  and  FPGA  resources  used. 

The  rest  of  the  paper  is  organized  as  follows:  Section  2  describes  the  multiple  LUT 
cascade.  Section  3  shows  other  realizations  for  the  LPM  address  generators.  Section  4 
presents  the  implementations  of  the  LPM  address  generator  using  an  FPGA.  Section  5 
shows  the  experimental  results.  Section  6  concludes  the  paper. 

2  Multiple  LUT  Cascades 

2.1  LPM  Address  Generators 

A  content-addressable  memory  (CAM)  [5]  stores  O’s  and  I’s  and  produces  the  address 
of  the  given  data.  A  TCAM,  unlike  a  CAM,  stores  O’s,  I’s,  and  *’s,  where  *  is  a  don’t 
care  value  that  matches  both  0  and  1 . 

TCAMs  are  extensively  used  in  routing  tables  for  the  internet.  A  routing  table  spec¬ 
ifies  an  interface  identifier  corresponding  to  the  longest  prefix  that  matches  an  incoming 
packet,  in  a  process  called  Longest  Prefix  Match  (LPM).  In  the  PLM  table,  the  ternary 
vectors  have  restricted  patterns:  the  prefix  consists  of  only  O’s  and  I’s,  and  postfix  con¬ 
sist  of  only  *’s  {don’t  cares).  In  this  paper,  this  type  of  vector  is  called  a  prefix  vector. 

Definition  2.1  An  n-input  m-output  k-entry  LPM  table  stores  k  n-element  prefix  vec¬ 
tors  of  the  form  VECi  ■  VEC2,  where  VECi  is  a  string  of  O’s  and  I’s,  and  VEC2 
is  a  string  of  *’s.  To  assure  that  the  longest  prefix  address  is  produced,  TCAM  entries 
are  stored  in  descending  prefix  length,  and  the  first  match  determines  the  LPM  table’s 
output.  An  address  is  an  m-element  binary  vector  for  m  =  \log2{k-\-l-)~\,  where  [a]  de¬ 
notes  the  smallest  integer  greater  than  or  equal  to  a.  The  corresponding  LPM  function 
is  a  logic  function  f  :  S”  — >  B"’’,  where  /( x )  is  the  smallest  address  of  an  entry  that  is 
identical  to  x  except  possibly  for  don ’t  care  values.  If  no  such  entry  exists,  f{x)=  0"*. 
The  LPM  address  generator  is  a  circuit  that  realizes  the  LPM  function. 

Example  2.1  Table  1  shows  an  LPM  table  with  5  4-element  prefix  vectors.  Table  2 
shows  the  corresponding  LPM  function.  It  has  16  entries,  one  for  each  4-bit  input.  The 
output  address  is  stored  for  each  input  corresponding  to  the  address  of  the  longest  prefix 
vector  that  matches  it.  (End  of  Example) 


2.2  An  LUT  Cascade  Realization  of  LPM  Address  Generators 

An  LPM  function  can  be  realized  by  a  single  memory.  However,  this  often  requires 
prohibitively  large  memory  size.  We  propose  functional  decomposition  [6, 7]  to  realize 
the  LPM  function  with  lower  storage  requirements.  For  a  given  LPM  function  f{x), 


Table  1.  LPM  table 


Table  2.  LPM  function 


Address 

Prefix  Vector 

1 

1000 

2 

010* 

3 

01** 

4 

5 

Input 

Output  Address 

Input 

Output  Address 

0000 

5 

1000 

1 

0001 

5 

1001 

4 

0010 

5 
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4 

0011 

5 

1011 

4 

0100 

2 

1100 

4 

0101 

2 

1101 

4 

0110 

3 

1110 

4 

0111 

3 
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4 

H 

#rails  =  flogj  1 

G 

Fig.  1.  Decomposition  for  the  LPM  function  / 


Fig.  2.  LUT  cascade 


let  X  be  partitioned  as  {x  a,  x g).  The  decomposition  chart  of  /  is  a  table  with  2"'^ 
columns  and  2"®  rows,  where  ua  and  ng  are  the  number  of  variables  in  a;^  and  x  g, 
respectively.  Each  column  and  row  is  labeled  by  a  binary  number,  and  the  corresponding 
element  in  the  table  denotes  the  value  of  /.  The  column  multiplicity,  /r,  is  the  number  of 
different  column  patterns  of  the  decomposition  chart.  Then,  using  functional  decompo¬ 
sition,  the  function  /  can  be  decomposed  as  f{x  a,  xg)  =  G{H{xa),  x g),  as  shown 
in  Fig.  1,  where  the  number  of  rails  (signal  lines  between  two  blocks  H  and  G)  is 
[log2  /r] .  By  iterative  functional  decomposition,  the  given  function  can  be  realized  by 
an  LUT  cascade,  as  shown  in  Fig.  2  [8, 9]. 

Theorem  2.1  [4 ]  An  n-input  LPM  address  generator  with  k  prefix  vectors  can  be 
realized  by  an  LUT  cascade,  where  each  cell  realizes  a  p-input,  r-output  combinational 
logic  function.  Let  s  be  the  necessary  number  of  levels  or  cells.  Then, 


s<\ 


n  —  r 
p  —  r 


where  p  >  r  and  r  =  |"log2(A:  +  1)]. 


(1) 


2.3  LPM  Address  Generators  Using  the  Multiple  LUT  Cascade 

A  single  LUT  cascade  realization  of  an  LPM  function  often  requires  many  levels.  Since 
the  delay  is  proportional  to  the  number  of  levels  in  a  cascade,  we  wish  to  reduce  the 
number  of  levels.  According  to  (1),  if  we  increase  p,  the  number  of  inputs  to  each  cell, 
then  the  number  of  levels  s  is  reduced.  For  each  increase  by  1  of  p,  the  memory  needed 
to  realize  the  cell  is  doubled.  However,  as  shown  in  Fig.  3,  we  can  use  the  multiple  LUT 
cascade  to  reduce  the  number  of  levels  s  while  keeping  p  fixed.  For  an  n-input  LPM 
function  with  k  prefix  vectors,  let  the  number  of  rails  of  each  LUT  cascade  be  r.  First, 


partition  the  set  of  prefix  vectors  into  g  groups  of  2’’  —  1  vectors  each,  except  the  last 
group,  which  has  2’’  —  1  or  fewer  vectors,  where  g  =  \  ■  For  ®^oh  group  of  prefix 

vectors,  form  an  independent  LPM  function.  Next,  partition  the  set  of  n  inputs  into  s 
groups.  Then,  realize  each  LPM  function  by  an  LUT  cascade.  Thus,  we  need  a  total 
of  g  LUT  cascades,  and  each  LUT  cascade  consists  of  s  cells.  Finally,  use  a  special 
encoder  to  produce  the  LPM  address.  Let  Vi  (z  =  1,2,  ...,5)  be  the  z-th  input  of  the 
special  encoder,  and  let  Vout  be  the  output  value  of  the  special  encoder.  That  is,  Vi  is 
the  output  value  of  the  z-th  LUT  cascade,  where  its  binary  output  values  are  viewed  as 
a  standard  binary  number.  Similarly,  Vout  is  the  output  of  the  special  encoder,  where 
its  binary  output  values  are  viewed  as  a  standard  binary  number.  Then,  we  have  the 
relation: 


_(  Vi  +  (i  —  1)(2’’  —  1)  if  Vi  ^  0  and  vj  =  0  for  all  1  <  j  <  z  —  1 

^  \  0  if  Vi  =  0  for  all  1  <  i  <  g. 

Note  that  Vout  is  the  position  of  a  prefix  vector  v  in  the  complete  LPM  table,  while  z  is 
the  index  to  the  LUT  cascade  storing  v.  (z  —  1)(2’'  —  1)  is  the  position  in  the  LPM  table 
of  the  last  entry  of  the  previous  (z  —  l)-th  LUT  cascade  or  is  0  in  the  case  of  the  first 
LUT  cascade.  Adding  Vi  to  this  yields  the  position  of  v  in  the  complete  LPM  table. 

Example  2.2  Consider  an  n-input  LPM  function  with  k  prefix  vectors.  When  k  =  1000 
and  zz  =  32,  by  Theorem  2.1,  we  have  r  =  10.  Let  p  =  r  +  1  =  \1.  When  we  use  a 

single  LUT  cascade  to  realize  the  function,  by  Theorem  2.1,  we  need  [^5^1  =  22 

cells,  and  the  number  of  levels  of  the  LUT  cascade  is  also  22.  Since  each  cell  has 
11  address  lines  and  10  outputs,  the  total  amount  of  memory  needed  to  realize  the 
cascade  is  2^^  x  10  x  22  =  450,560  bits.  Note  that  the  memory  size  of  each  cell, 
2^^  X  10  =  20, 480  bits,  is  too  large  to  be  realized  by  a  single  block  RAM  (BRAM)  of 
our  FPGA,  which  stores  18, 432  bits. 

However,  if  we  use  a  multiple  LUT  cascade  to  realize  the  function,  we  can  reduce 
the  number  of  levels  and  the  total  amount  of  memory.  Also,  the  cells  will  fit  into  the 
BRAMs  in  the  FPGAs.  Partition  the  set  of  vectors  into  two  groups,  and  realize  each 
group  independently;  then,  we  need  two  LUT  cascades.  For  each  LUT  cascade,  the 
number  of  vectors  is  500,  so  we  have  r  =  9.  Also,  let  p  =  r  +  2  =  11.  Then,  we  need 
=  12  cells  in  each  cascade.  Note  that  the  number  of  levels  of  the  LUT  cascades 
is  12,  which  is  smaller  than  the  22  needed  in  the  single  LUT  cascade  realization.  Since 
each  cell  consists  of  a  memory  with  9  outputs  and  at  most  11  address  lines,  the  total 
amount  of  memory  is  at  most  2^^  x  9  x  12  x  2  =  442,  368  bits.  Also,  note  that  the  size 
of  the  memory  for  a  single  cell  is  2^^  x  9  =  18, 432  bits.  This  fits  exactly  in  the  BRAMs 
of  the  FPGAs. 

Thus,  the  multiple  LUT  cascade  not  only  reduces  the  number  of  levels  and  the  total 
amount  of  memory,  but  also  adjusts  the  size  of  cells  to  fit  into  the  available  memory  in 
the  FPGAs.  (End  of  Example) 

Fig.  3  shows  the  architecture  of  the  multiple  LUT  cascade.  The  realization  with 
this  architecture  is  the  multiple  LUT  cascade  realization.  It  consists  of  a  group  of 
LUT  cascades  and  a  special  encoder.  The  inputs  of  each  LUT  cascade  are  common 
with  other  LUT  cascades,  while  the  outputs  of  each  LUT  cascade  are  connected  to  the 


Fig.  3.  Architecture  of  the  multiple  LUT  cascade 


Fig.  4.  Detailed  design  of  the  LUT  cascade 


special  encoder.  Each  LUT  cascade  realizes  an  LPM  function,  while  the  special  encoder 
generates  the  LPM  address  from  the  outputs  of  cascades. 

Lor  an  n-input  LPM  function  with  k  prefix  vectors,  the  detailed  design  of  the  LUT 
cascade  is  shown  in  Lig.  4,  where  (i  =  1,  2, ...,  s)  denotes  the  primary  inputs  to  the 

i-th  cell,  di  (z  =  1, 2, ...,  s)  denotes  the  data  inputs  to  the  z-th  cell  and  provides  the  data 
value  to  be  written  in  the  RAM  of  the  z-th  cell,  r  denotes  the  number  of  rails,  where 
X  <  riog2(fc-|- 1)],  c  j  {j  =  2, 3, ...,  s)  denotes  the  additional  inputs  to  the  j-th  cell  and 

is  used  to  select  the  RAM  location  along  with  Xj  for  write  access.  Note  that  c  j  and  di 
are  represented  by  r  bits.  All  RAMs  except  perhaps  the  last  one  have  p  address  lines; 
the  last  RAM  has  at  most  p  address  lines.  When  WE  is  high,  the  Cj  is  connected  to  the 
RAM  to  write  the  data  into  the  RAMs.  When  WE  is  low,  the  outputs  of  the  RAMs  are 
connected  to  the  inputs  of  the  succeeding  RAMs,  and  the  circuit  works  as  a  cascade  to 


Table  3.  6-entry  LPM  table 


Table  4.  Truth  table  for  the  corresponding  LPM  function 


(a)  Single  LUT  cascade  realization  (b)  Multiple  LUT  cascade  realization 


Fig.  5.  Single  LUT  cascade  realization  and  the  multiple  LUT  cascade  realization 


realize  the  LPM  function.  Note  that  the  RAMs  are  synchronous  RAMs.  Therefore,  the 
LUT  cascade  resembles  a  shift  register. 

Example  2.3  Table  3  shows  a  6-input  5-output  6-entry  LPM  table,  and  the  correspond¬ 
ing  LPM  function  is  shown  in  Table  4.  Note  that  the  entries  in  the  two  tables  are  similar. 
Table  4  is  a  compact  truth  table,  showing  only  non-zero  outputs.  Its  input  combinations 
are  disjoint.  Thus,  the  two  tables  are  the  same  except  for  three  entries. 

Single  Memory  Realization:  The  number  of  address  lines  is  6,  and  the  number  of 
outputs  is  3.  Thus,  the  total  amount  of  memory  is  2®  x  3  =  192  bits. 

Single  LUT  Cascade  Realization:  Since  there  are  k  =  6  prefix  vectors  of  the 
function,  by  Theorem  2.1,  the  number  of  rails  is  r  =  [log2  (6  +  1)]  =  3.  Let  the 
number  of  address  lines  for  the  memory  in  a  cell  be  p  =  A.  By  partitioning  the  inputs 
into  three  disjoint  sets  {xi,  X2,  X3,  X4},  {x^},  and  {cce},  we  have  the  cascade  in  Fig.  5 
(a),  where  only  the  signal  lines  for  cascade  realization  are  shown,  and  other  lines  such 
as  for  storing  data  are  omitted  for  simplicity. 

The  total  amount  of  memory  A  2^  x  3  x  3  =  144  bits,  and  the  number  of  levels 
is  s  =  3.  Note  that  the  single  LUT  cascade  requires  75%  of  the  memory  needed  in  the 
single  memory  realization. 


Table  5.  Truth  tables  for  the  cells  in  the  multiple  LUT  cascade  realization 


Cell  1  and  Cell  2  (upper  LUT  cascade)  I 

1  Cell  3  and  Cell  4  (lower  LUT  cascade)  I 

\xi  X2  X3  X4 

yi  j/2 

X5  X6 

21  22 

Vl 

Vout 

\xi  X2  X3  X4 

J/3  1/4 

X5  Xq\ 

23  24 

l>2 

Vout 

1 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

001 

1 

0 

1 

1 

0 

0 

* 

* 

0 

1 

1 

100 

1 

0 

0 

1 

0 

1 

0 

* 

1 

0 

2 

010 

1 

0 

0 

* 

0 

1 

* 

* 

1 

0 

2 

101 

1 

0 

1 

0 

1 

0 

* 

* 

1 

1 

3 

oil 

1 

1 

* 

* 

1 

0 

* 

* 

1 

1 

3 

110 

Other  values 

1 

1 

* 

* 

0 

0 

0 

t 

Other  values 

1 

1 

* 

* 

0 

0 

0 

t 

Other  values 

0 

0 

0 

t 

Other  values 

0 

0 

0 

t 

f  depends  on  values  from  the  other  LUT  cascade 


Multiple  LUT  cascade  Realization:  Partition  Table  3  into  two  parts,  each  with 
three  prefix  vectors.  The  number  of  rails  in  the  LUT  cascades  associated  with  each  sep¬ 
arate  LPM  table  is  [log2  (3  +  1)]  =2.  Let  the  number  of  address  lines  for  the  memory 
in  a  cell  be  p  =  A.  By  partitioning  the  inputs  into  two  disjoint  sets  {xi,  X2,  x^,  xfij  and 
{xsjXg},  vce  obtain  the  realization  in  Fig.  5  (b).  The  upper  LUT  cascade  realizes  the 
upper  part  of  the  Table  4,  while  the  lower  LUT  cascade  realizes  the  lower  part  of  the 
Table  4.  The  contents  of  each  cell  is  shown  in  Table  5. 

Let  vi  be  the  output  value  of  the  upper  LUT  cascade,  let  V2  be  the  output  value  of 
the  lower  LUT  cascade,  and  let  Vout  be  the  output  value  of  the  special  encoder.  Then, 
in  Table  5,  (21,2:2)  viewed  as  a  standard  binary  number,  has  value  Vi,  while  (23,24) 
viewed  as  a  standard  binary  number,  has  value  V2.  The  special  encoder  generates  the 
LPM  address  from  the  pair  of  outputs,  (21 , 22)  and  (23,  24)  : 

0Ut2  =  ZiZ2{Z3  V  24), 

OUti  =  2i  V  222324, 
onto  =  22  V  212324. 

Note  that  {out2,  outi,  outf)  viewed  as  a  standard  binary  number,  has  value  Vout  corre¬ 
sponding  to  the  address  in  Table  3.  The  total  amount  of  memory  is  2^  x  2  x  4  =  128 
bits,  and  the  number  of  levels  is  2.  Note  that  the  multiple  LUT  cascade  realization  re¬ 
quires  89%  of  the  memory  and  one  fewer  levels  than  the  single  LUT  cascade  realization. 

(End  of  Example) 


3  Other  Realizations 

3.1  Xilinx’s  TCAM 

Xilinx  [10]  provides  a  proprietary  realization  of  a  TCAM  that  is  produced  by  the  Xil- 
inx  CORE  Generator  tool  [11].  Since  a  TCAM  can  directly  realize  an  LPM  address 
generator,  we  compare  our  proposed  multiple  LUT  cascade  realization  with  Xilinx’s 
TCAM.  In  the  Xilinx  CORE  Generator  7.1i,  we  used  the  following  parameters  to  pro¬ 
duce  TCAMs. 


-  SRL16  implementation. 


Fig.  6.  Realize  the  address  generator  with  registers  and  gates 


-  Standard  Ternary  Mode'.  Generate  a  standard  ternary  CAM. 

-  Depth-  Number  of  words  (vectors)  stored  in  the  TCAM:  k. 

-  Data  width-  Width  of  the  data  word  (vector)  stored  in  the  TCAM;  n. 

-  Match  Address  Type-  Three  options;  Binary  Encoded,  Single-match  Unencoded, 
and  Multi-match  Unencoded.  We  used  the  Binary  Encoded  option. 

-  Address  Resolution-  Lowest  or  Highest.  We  used  the  Lowest  option. 

3.2  Registers  and  Gates 

We  also  compare  our  proposed  multiple  LUT  cascade  realization  with  a  direct  realiza¬ 
tion  using  registers  and  gates,  as  shown  in  Fig  6.  We  use  a  register  pair  (Reg.  1  and  Reg. 
0)  to  store  each  digit  of  a  ternary  vector.  For  example,  if  the  digit  is  *  {don ’t  care),  the 
register  pair  stores  (1,1).  Thus,  for  n  bit  data,  we  need  a  2n-bit  register.  The  comparison 
circuit  consists  of  an  n-input  AND  gate  and  n  1-bit  comparison  circuits,  each  of  which 
produces  a  1  if  and  only  if  the  input  bit  matches  the  stored  bit  or  the  stored  bit  is  don ’t 
care  (*  or  11). 

For  each  prefix  vector  of  an  n-input  LPM  address  generator,  we  need  a  2n-bit  reg¬ 
ister,  n  copies  of  1-bit  comparison  circuits,  and  an  n-input  AND  gate.  For  an  n-input 
address  generator  with  k  registered  prehx  vectors,  we  need  k  copies  of  2n-bit  registers, 
nk  copies  of  1-bit  comparison  circuits,  and  k  copies  of  n-input  AND  gates.  In  addition, 
we  need  a  priority  encoder  with  k  inputs  and  [log2  {k  -f  1)]  outputs  to  generate  the 
LPM  address.  If  the  n-input  AND  gate  is  realized  as  a  cascade  of  2-input  AND  gates, 
this  circuit  can  be  considered  as  a  special  case  of  the  multiple  LUT  cascade  architec¬ 
ture,  where  r  =  1,  p  =  2,  and  g  =  k.  Note  that  the  output  encoder  circuit  is  a  standard 
priority  encoder. 

4  FPGA  Implementations 

We  implemented  the  LPM  address  generators  for  32  inputs  and  504~5 1 1  registered 
prefix  vectors  on  Xilinx  Spartan-3  FPGAs  (XC3S4000-5)  [12]  by  using  the  multiple 


Table  6.  Four  Multiple  LUT  Cascade  Realizations 


Design 

Number  of  prefix  vectors 

r 

P 

Group 

Level 

rdpll 

504 

6 

11 

8 

6 

r7pll 

508 

7 

11 

4 

7 

rSpll 

510 

8 

11 

2 

8 

r9pll 

511 

9 

11 

1 

12 

r:  Number  of  rails 

p:  Number  of  address  lines  of  the  RAM  in  a  cell 
Group:  Number  of  LUT  cascades 


Fig.  7.  Architecture  of  rSpll 


LUT  cascade,  Xilinx  CORE  Generator  7.1i,  and  registers  &  gates.  The  FPGA  device 
XC3S4000-5  has  96  BRAMs  and  27648  slices.  Each  BRAM  contains  18K  bits,  and 
each  slice  consists  of  two  4-input  LUTs,  two  D-type  flip-flops,  and  multiplexers.  Eor 
each  implementation,  we  described  the  circuit  by  Verilog  HDL,  and  then  used  Xilinx 
ISE  7.  li  to  synthesize  and  to  perform  place  and  route. 

First,  we  used  the  multiple  LUT  cascade  to  realize  the  LPM  address  generators.  To 
use  the  BRAMs  in  the  FPGA  efficiently,  the  memory  size  of  a  cell  in  the  LUT  cascade 
should  not  exceed  the  BRAM  size.  Let  p  be  number  of  address  lines  of  the  memory  in 
the  cell.  Since  each  BRAM  contains  2^^  x  9  bits,  we  have  the  relation:  2^  •  r  <  2^^  x  9, 
where  r  is  the  number  of  rails.  Thus,  we  havep  =  [log2  (9/r)jH-  11,  where  [oj  denotes 
the  largest  integer  less  than  or  equal  to  a. 

We  designed  four  kinds  of  LPM  address  generators  r6pll,  rTpll,  rSpll,  and 
r9pll,  as  shown  in  Table  6,  where  the  column  Number  of  prefix  vectors  denotes  the 
number  of  registered  prefix  vectors,  the  column  r  denotes  the  number  of  rails,  the  col¬ 
umn  p  denotes  the  number  of  address  lines  of  the  RAM  in  a  cell,  the  column  Group 
denotes  the  number  of  LUT  cascades,  and  the  column  Level  denotes  the  number  of 
levels  or  cells  in  the  LUT  cascade. 

To  explain  Table  6,  consider  rSpll  which  is  shown  in  Fig  7.  For  rSpll,  since  the 
number  of  rails  is  r  =  8,  the  number  of  groups  is  [  2^^]  =  2.  Thus,  we  need  two  LUT 
cascades.  Since  each  LUT  cascade  consists  of  8  cells,  the  number  of  levels  of  rSpll  is 
8.  To  efficiently  use  BRAMs  in  the  FPGA,  the  number  of  address  lines  of  the  RAM  in 
the  cell  is  set  top  =  [log2  (9/8)  Jh-  11  =  11.  Let  vi  be  the  values  of  the  outputs  of  the 


upper  LUT  cascade,  let  V2  be  the  values  of  the  outputs  of  the  lower  LUT  cascade,  and 
let  Vout  be  the  values  of  the  outputs  of  the  special  encoder.  Then,  we  have  the  relation; 

_  f  ^2  +  255  if  vi=0  and  V2  ^  0, 

Vout  “  Otherwise. 

This  expression  requires  1 1  slices  to  implement  on  the  FPGA.  After  synthesizing  and 
mapping,  rSpll  required  16  BRAMs  and  69  slices.  From  this  table,  we  can  see  that 
decreasing  r,  increases  the  number  of  groups,  but  decreases  the  number  of  levels. 

Next,  we  used  the  Xilinx  CORE  Generator  7.1i  tool  to  produce  Xilinx’s  TCAM. 
Since  the  Xilinx  CORE  Generator  7.1i  does  not  support  TCAMs  with  32  inputs  and 
505~511  registered  prefix  vectors,  we  designed  a  TCAM  with  32  inputs  and  504  reg¬ 
istered  prefix  vectors.  After  synthesizing  and  mapping,  the  resulting  TCAM  required 
8,590  slices.  Note  that  Xilinx’s  TCAM  requires  one  clock  cycle  to  find  a  match. 

Einally,  we  designed  the  LPM  address  generator  with  n  =  32  inputs  and  fc  =  511 
registered  prefix  vectors  using  registers  and  gates,  as  shown  in  Pig  6.  This  design  is 
denoted  Reg-Gates.  Note  that  the  number  of  inputs  is  32  and  the  number  of  outputs  is 
9.  After  synthesizing  and  mapping,  this  design  required  27,646  slices. 


5  Performance  and  Comparisons 


In  Table  7,  we  show  the  performance  of  multiple  LUT  cascade  realizations  (i.e.,  r6pll, 
r7pll,  rSpll,  and  rOpll),  and  compare  them  with  Xilinx’s  TCAM  and  Reg-Gates.  In 
Table  7,  the  column  Level  denotes  the  number  of  levels  or  cells  in  the  LUT  cascade, 
the  column  Slice  denotes  the  number  of  occupied  slices,  the  column  Memory  denotes 
the  amount  of  memory  required,  and  the  column  F_clk  denotes  the  maximum  clock 
frequency.  The  column  tco  denotes  maximum  clock-to-output  propagation  delay.  (It  is 
the  maximum  time  required  to  obtain  a  valid  output  at  output  pin  that  is  fed  by  a  register 
after  a  clock  signal  transition  on  an  input  pin  that  clocks  the  register).  The  column  tpd 
denotes  the  maximum  propagation  time  from  the  inputs  to  the  outputs.  The  column  Th. 
denotes  the  maximum  throughput.  Since  the  LPM  address  generator  has  9  outputs,  it  is 
calculated  by: 

Th.  =  9  •  L_clk. 


Lor  Reg-Gates,  Delay  denotes  the  maximum  delay  from  the  input  to  the  output  and  is 
equal  to  tpd.  Lor  multiple  LUT  cascade  realizations  and  Xilinx’s  TCAM,  Delay  denotes 
the  total  delay,  and  is  calculated  by: 


Delay 


1000  •  Level 
L_clk 


+  tco, 


where  1000  is  a  unit  conversion  factor. 

Consider  the  area  occupied  by  the  various  realizations.  Lrom  the  Spartan-3  family 
architecture  [12],  we  can  see  that  the  area  of  one  BRAM  is  at  least  the  area  of  16  slices 
(a  slice  consists  of  two  “4-input  LUTs”,  two  flip-flops,  and  miscellaneous  multiplexers). 

An  alternative  estimate  shows  that  the  area  of  one  BRAM  is  equivalent  to  that  of 
96  slices,  as  follows.  In  the  Xilinx  Virtex-II  LPGA,  one  “4-input  LUT”  occupies  ap¬ 
proximately  the  same  area  as  96  bits  of  BRAM  (also  containing  18K  bits)  [13].  Note 


Table  7.  Comparisons  of  FPGA  implementations  of  the  LPM  address  generator 


Design 

Level 

Slice 

Memory 

(BRAM) 

F_clk 

(MHz) 

tco/tpd 

(ns) 

Th. 

(Mbps) 

Area 

(slice) 

Th./Area 

,  Mbps , 

^  slice  ' 

Delay 

(ns) 

Area-Delay 

(slice-ns) 

r6pll 

6 

178 

48 

103.89 

24.89 

(tco) 

935 

4786 

0.195 

82.64 

395.53 

rlpll 

7 

116 

28 

113.77 

23.46 

(tco) 

1024 

2804 

0.365 

84.99 

238.31 

rSpll 

8 

69 

16 

139.93 

20.91 

(tco) 

1259 

(best) 

1605 

0.785 

79.57 

127.71 

r9pll 

12 

99 

12 

139.08 

13.72 

(tco) 

1252 

1251 

(best) 

1.001 

(best) 

100.00 

125.10 

(best) 

Xilinx’s 

TCAM 

1 

8590 

22.52 

13.48 

(tco) 

203 

8590 

0.024 

57.88 

(best) 

497.23 

Reg- 

Gates 

27646 

58.67 

(tpd) 

27646 

58.67 

1621.99 

Area:  We  assume  that  the  area  for  one  BRAM  is  equivalent  to  the  area  for  96  slices 


that  both  “4-input  LUTs”  and  BRAMs  of  the  Virtex-II  FPGA  are  similar  to  those  of  the 
Spartan-3  FPGA.  Thus,  we  can  deduce  that  one  BRAM  of  the  Spartan-3  FPGA  occu¬ 
pies  about  the  same  area  as  192  (=  18  x  1024/96)  “4-input  LUTs”.  If  we  view  one 
“4-input  LUT”  as  approximately  one-half  a  slice  according  to  our  discussion  in  the  pre¬ 
vious  paragraph,  we  conclude  that  one  BRAM  has  about  the  same  area  as  96  (=  192/2) 
slices.  Thus,  estimates  of  the  area  for  one  BRAM  vary  between  the  area  for  16  to  96 
slices.  For  this  analysis  a  worst  case  of  96  slices/BRAM  was  used. 

In  Table  7,  the  column  Area  denotes  the  equivalent  utilized  area,  where  the  area 
for  one  BRAM  is  equivalent  to  the  area  for  96  slices.  The  column  Th./Area  denotes 
the  efficiency  of  throughput  per  area  for  one  slice.  The  column  Area-Delay  denotes  the 
area-delay  product.  The  value  denoted  by  best  shows  the  best  result. 

Xilinx’s  TCAM  has  the  smallest  delay,  but  requires  many  slices.  Reg-Gates  has 
almost  the  same  delay  as  Xilinx’s  TCAM,  but  requires  about  three  times  as  many  slices 
as  Xilinx’s  TCAM.  Note  that  Reg-Gates  requires  no  clock  pulses  in  the  LPM  address 
generation  operation,  while  the  others  are  sequential  circuits  that  require  clock  pulses. 
Since  the  delay  of  Reg-Gates  is  58.67  ns,  the  equivalent  throughput  is  (1000/58.67)  x 
9  =  153  (Mbps),  which  is  lower  than  all  others. 

All  multiple  LUT  cascade  realizations  have  higher  throughput,  smaller  area,  higher 
throughput/area,  and  are  more  efficient  in  terms  of  area-delay  than  Xilinx’s  TCAM. 
r9pll  has  the  smallest  area,  the  highest  throughput/area,  the  most  efficient  in  terms 
of  area-delay,  but  has  the  largest  delay.  rSpl  1  has  the  highest  throughput,  and  has  the 
smallest  delay  among  all  multiple  LUT  cascade  realizations.  Furthermore,  in  terms  of 
area-delay,  rSpll  has  almost  the  same  performance  as  r9pll.  Thus,  rSpll  is  the  best 
multiple  LUT  cascade  realization  that  has  5.20  times  more  throughput,  31.71  times 
more  throughput/area,  and  is  2.89  times  more  efficient  in  terms  of  area-delay  product 
than  Xilinx’s  TCAM,  while  the  area  is  only  19%  of  Xilinx’s  TCAM. 


6  Conclusions 


In  this  paper,  we  presented  the  multiple  LUT  cascade  to  realize  LPM  address  generators. 
Although  we  illustrated  the  design  method  for  n  =  32  and  k  =  504  ~  511,  it  can  be 
extended  to  any  value  of  n  and  k. 

We  implemented  four  kinds  of  LPM  address  generators  (i.e.  r6pll,  rlpll,  rSpll, 
and  rOpll)  on  the  Xilinx  Spartan-3  FPGA  (XC3S4000-5)  by  using  the  multiple  LUT 
cascade.  For  comparison,  we  also  implemented  Xilinx’s  proprietary  TCAM,  and  Reg- 
Gates  by  using  registers  and  gates  on  the  same  type  of  FPGA.  Xilinx’s  TCAM  has 
the  smallest  delay,  but  requires  many  slices.  Reg-Gates  has  almost  the  same  delay  as 
Xilinx’s  TCAM,  but  requires  the  largest  area  and  requires  about  three  times  as  many 
slices  as  Xilinx’s  TCAM.  All  multiple  LUT  cascade  realizations  have  higher  through¬ 
put,  smaller  area,  higher  throughput/area  and  more  efficient  in  terms  of  area-delay  prod¬ 
uct  than  Xilinx’s  TCAM. 
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